# Vision Transformer architecture
Sapiens Seg 0.6b Bfloat16
Sapiens is a family of Vision Transformer models pre-trained on 300 million 1024x1024 resolution human images, focusing on human-centric vision tasks.
Image Segmentation English
S
facebook
24
0
Sapiens Seg 0.3b
Sapiens is a family of Vision Transformer models pre-trained on 300 million 1024×1024 resolution human images, focusing on human-centric vision tasks.
Image Segmentation English
S
facebook
48
2
Vit Base Patch32 224 In21k
Apache-2.0
This Vision Transformer (ViT) model is pretrained on the ImageNet-21k dataset at 224x224 resolution, suitable for image classification tasks.
Image Classification
V
google
35.10k
19
Dpt Large Ade
Apache-2.0
This is a Dense Prediction Transformer (DPT) model fine-tuned on the ADE20k dataset for semantic segmentation tasks.
Image Segmentation
Transformers

D
Intel
3,497
8
Dpt Large
Apache-2.0
A monocular depth estimation model based on Vision Transformer (ViT), trained on 1.4 million images, suitable for zero-shot depth prediction tasks.
3D Vision
Transformers

D
Intel
364.62k
187
Beit Large Finetuned Ade 640 640
Apache-2.0
BEiT is an image segmentation model based on Vision Transformer architecture, achieving efficient semantic segmentation through self-supervised pre-training and fine-tuning on the ADE20k dataset.
Image Segmentation
Transformers

B
microsoft
14.97k
14
Featured Recommended AI Models